Toxic Comment Filter

BiLSTM model for multi label classification
code
Deep Learning
Python, R
Author

Simone Brazzi

Published

August 12, 2024

1 Introduction

  • Costruire un modello in grado di filtrare i commenti degli utenti in base al grado di dannosità del linguaggio.
  • Preprocessare il testo eliminando l’insieme di token che non danno contributo significativo a livello semantico.
  • Trasformare il corpus testuale in sequenze.
  • Costruire un modello di Deep Learning comprendente dei layer ricorrenti per un task di classificazione multilabel.

In prediction time, il modello deve ritornare un vettore contenente un 1 o uno 0 in corrispondenza di ogni label presente nel dataset (toxic, severe_toxic, obscene, threat, insult, identity_hate). In questo modo, un commento non dannoso sarà classificato da un vettore di soli 0 [0,0,0,0,0,0]. Al contrario, un commento pericoloso presenterà almeno un 1 tra le 6 labels.

2 Setup

Leveraging Quarto and RStudio, I will setup an R and Python enviroment.

2.1 Import R libraries

Import R libraries. These will be used for both the rendering of the document and data analysis. The reason is I prefer ggplot2 over matplotlib. I will also use colorblind safe palettes.

Code
library(tidyverse, verbose = FALSE)
library(tidymodels, verbose = FALSE)
library(reticulate)
library(ggplot2)
library(plotly)
library(RColorBrewer)
library(bslib)
library(Metrics)

reticulate::use_virtualenv("r-tf")

2.2 Import Python packages

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras
import keras_nlp

from keras.backend import clear_session
from keras.models import Model, load_model
from keras.layers import TextVectorization, Input, Dense, Embedding, Dropout, GlobalAveragePooling1D, LSTM, Bidirectional, GlobalMaxPool1D, Flatten, Attention
from keras.metrics import Precision, Recall, AUC, SensitivityAtSpecificity, SpecificityAtSensitivity, F1Score


from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import multilabel_confusion_matrix, classification_report, ConfusionMatrixDisplay, precision_recall_curve, f1_score, recall_score, roc_auc_score

Create a Config class to store all the useful parameters for the model and for the project.

2.3 Class Config

I created a class with all the basic configuration of the model, to improve the readability.

Code
class Config():
    def __init__(self):
        self.url = "https://s3.eu-west-3.amazonaws.com/profession.ai/datasets/Filter_Toxic_Comments_dataset.csv"
        self.max_tokens = 20000
        self.output_sequence_length = 911 # check the analysis done to establish this value
        self.embedding_dim = 128
        self.batch_size = 32
        self.epochs = 100
        self.temp_split = 0.3
        self.test_split = 0.5
        self.random_state = 42
        self.total_samples = 159571 # total train samples
        self.train_samples = 111699
        self.val_samples = 23936
        self.features = 'comment_text'
        self.labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
        self.new_labels = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', "clean"]
        self.label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.new_label_mapping = {label: i for i, label in enumerate(self.labels)}
        self.path = "/Users/simonebrazzi/R/blog/posts/toxic_comment_filter/history/f1score/"
        self.model =  self.path + "model_f1.keras"
        self.checkpoint = self.path + "checkpoint.lstm_model_f1.keras"
        self.history = self.path + "lstm_model_f1.xlsx"
        
        self.metrics = [
            Precision(name='precision'),
            Recall(name='recall'),
            AUC(name='auc', multi_label=True, num_labels=len(self.labels)),
            F1Score(name="f1", average="macro")
            
        ]
    def get_early_stopping(self):
        early_stopping = keras.callbacks.EarlyStopping(
            monitor="val_f1", # "val_recall",
            min_delta=0.2,
            patience=10,
            verbose=0,
            mode="max",
            restore_best_weights=True,
            start_from_epoch=3
        )
        return early_stopping

    def get_model_checkpoint(self, filepath):
        model_checkpoint = keras.callbacks.ModelCheckpoint(
            filepath=filepath,
            monitor="val_f1", # "val_recall",
            verbose=0,
            save_best_only=True,
            save_weights_only=False,
            mode="max",
            save_freq="epoch"
        )
        return model_checkpoint

    def find_optimal_threshold_cv(self, ytrue, yproba, metric, thresholds=np.arange(.05, .35, .05), n_splits=7):

      # instantiate KFold
      kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
      threshold_scores = []

      for threshold in thresholds:

        cv_scores = []
        for train_index, val_index in kf.split(ytrue):

          ytrue_val = ytrue[val_index]
          yproba_val = yproba[val_index]

          ypred_val = (yproba_val >= threshold).astype(int)
          score = metric(ytrue_val, ypred_val, average="macro")
          cv_scores.append(score)

        mean_score = np.mean(cv_scores)
        threshold_scores.append((threshold, mean_score))

        # Find the threshold with the highest mean score
        best_threshold, best_score = max(threshold_scores, key=lambda x: x[1])
      return best_threshold, best_score

config = Config()

3 Data

The dataset is accessible using tf.keras.utils.get_file to get the file from the url. N.B. For reproducibility purpose, I also downloaded the dataset. There was time in which the link was not available.

Code
# df = pd.read_csv(config.path)
file = tf.keras.utils.get_file("Filter_Toxic_Comments_dataset.csv", config.url)
df = pd.read_csv(file)
Code
library(reticulate)

py$df %>%
  tibble() %>% 
  head(5)
Table 1: First 5 elemtns
# A tibble: 5 × 8
  comment_text            toxic severe_toxic obscene threat insult identity_hate
  <chr>                   <dbl>        <dbl>   <dbl>  <dbl>  <dbl>         <dbl>
1 "Explanation\nWhy the …     0            0       0      0      0             0
2 "D'aww! He matches thi…     0            0       0      0      0             0
3 "Hey man, I'm really n…     0            0       0      0      0             0
4 "\"\nMore\nI can't mak…     0            0       0      0      0             0
5 "You, sir, are my hero…     0            0       0      0      0             0
# ℹ 1 more variable: sum_injurious <dbl>

Lets create a clean variable for EDA purpose: I want to visually see how many observation are clean vs the others labels.

Code
df.loc[df.sum_injurious == 0, "clean"] = 1
df.loc[df.sum_injurious != 0, "clean"] = 0

3.1 EDA

First a check on the dataset to find possible missing values and imbalances.

3.1.1 Frequency

Code
library(reticulate)
df_r <- py$df
new_labels_r <- py$config$new_labels

df_r_grouped <- df_r %>% 
  select(all_of(new_labels_r)) %>%
  pivot_longer(
    cols = all_of(new_labels_r),
    names_to = "label",
    values_to = "value"
  ) %>% 
  group_by(label) %>%
  summarise(count = sum(value)) %>% 
  mutate(freq = round(count / sum(count), 4))

df_r_grouped
Table 2: Absolute and relative labels frequency
# A tibble: 7 × 3
  label          count   freq
  <chr>          <dbl>  <dbl>
1 clean         143346 0.803 
2 identity_hate   1405 0.0079
3 insult          7877 0.0441
4 obscene         8449 0.0473
5 severe_toxic    1595 0.0089
6 threat           478 0.0027
7 toxic          15294 0.0857

3.1.2 Barchart

Code
library(reticulate)
barchart <- df_r_grouped %>%
  ggplot(aes(x = reorder(label, count), y = count, fill = label)) +
  geom_col() +
  labs(
    x = "Labels",
    y = "Count"
  ) +
  # sort bars in descending order
  scale_x_discrete(limits = df_r_grouped$label[order(df_r_grouped$count, decreasing = TRUE)]) +
  scale_fill_brewer(type = "seq", palette = "RdYlBu") +
  theme_minimal()
ggplotly(barchart)
Figure 1: Imbalance in the dataset with clean variable

It is visible how much the dataset in imbalanced. This means it could be useful to check for the class weight and use this argument during the training.

It is clear that most of our text are clean. We are talking about 0.8033 of the observations which are clean. Only 0.1967 are toxic comments.

3.2 Sequence lenght definition

To convert the text in a useful input for a NN, it is necessary to use a TextVectorization layer. See the Section 4 section.

One of the method is output_sequence_length: to better define it, it is useful to analyze our text length. To simulate what the model we do, we are going to remove the punctuation and the new lines from the comments.

3.2.1 Summary

Code
library(reticulate)
df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  pull(text_length) %>% 
  summary() %>% 
  as.list() %>% 
  as_tibble()
Table 3: Summary of text length
# A tibble: 1 × 6
   Min. `1st Qu.` Median  Mean `3rd Qu.`  Max.
  <dbl>     <dbl>  <dbl> <dbl>     <dbl> <dbl>
1     4        91    196  378.       419  5000

3.2.2 Boxplot

Code
library(reticulate)
boxplot <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
    ) %>% 
  # pull(text_length) %>% 
  ggplot(aes(y = text_length)) +
  geom_boxplot() +
  theme_minimal()
ggplotly(boxplot)
Figure 2: Text length boxplot

3.2.3 Histogram

Code
library(reticulate)
df_ <- df_r %>% 
  mutate(
    comment_text_clean = comment_text %>%
      tolower() %>% 
      str_remove_all("[[:punct:]]") %>% 
      str_replace_all("\n", " "),
    text_length = comment_text_clean %>% str_count()
  )

Q1 <- quantile(df_$text_length, 0.25)
Q3 <- quantile(df_$text_length, 0.75)
IQR <- Q3 - Q1
upper_fence <- as.integer(Q3 + 1.5 * IQR)

histogram <- df_ %>% 
  ggplot(aes(x = text_length)) +
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = upper_fence), color = "red", linetype = "dashed", linewidth = 1) +
  theme_minimal() +
  xlab("Text Length") +
  ylab("Frequency") +
  xlim(0, max(df_$text_length, upper_fence))
ggplotly(histogram)
Figure 3: Text length histogram with boxplot upper fence

Considering all the above analysis, I think a good starting value for the output_sequence_length is 911, the upper fence of the boxplot. In the last plot, it is the dashed red vertical line.. Doing so, we are removing the outliers, which are a small part of our dataset.

3.3 Dataset

Now we can split the dataset in 3: train, test and validation sets. Considering there is not a function in sklearn which lets split in these 3 sets, we can do the following: - split between a train and temporary set with a 0.3 split. - split the temporary set in 2 equal sized test and val sets.

Code
x = df[config.features].values
y = df[config.labels].values

xtrain, xtemp, ytrain, ytemp = train_test_split(
  x,
  y,
  test_size=config.temp_split, # .3
  random_state=config.random_state
  )
xtest, xval, ytest, yval = train_test_split(
  xtemp,
  ytemp,
  test_size=config.test_split, # .5
  random_state=config.random_state
  )

xtrain shape: py$xtrain.shape ytrain shape: py$ytrain.shape xtest shape: py$xtest.shape ytest shape: py$ytest.shape xval shape: py$xval.shape yval shape: py$yval.shape

The datasets are created using the tf.data.Dataset function. It creates a data input pipeline. The tf.data API makes it possible to handle large amounts of data, read from different data formats, and perform complex transformations. The tf.data.Dataset is an abstraction that represents a sequence of elements, in which each element consists of one or more components. Here each dataset is creates using from_tensor_slices. It create a tf.data.Dataset from a tuple (features, labels). .batch let us work in batches to improve performance, while .prefetch overlaps the preprocessing and model execution of a training step. While the model is executing training step s, the input pipeline is reading the data for step s+1. Check the documentation for further informations.

Code
train_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtrain, ytrain))
    .shuffle(xtrain.shape[0])
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

test_ds = (
    tf.data.Dataset
    .from_tensor_slices((xtest, ytest))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)

val_ds = (
    tf.data.Dataset
    .from_tensor_slices((xval, yval))
    .batch(config.batch_size)
    .prefetch(tf.data.experimental.AUTOTUNE)
)
Code
print(
  f"train_ds cardinality: {train_ds.cardinality()}\n",
  f"val_ds cardinality: {val_ds.cardinality()}\n",
  f"test_ds cardinality: {test_ds.cardinality()}\n"
  )
train_ds cardinality: 3491
 val_ds cardinality: 748
 test_ds cardinality: 748

Check the first element of the dataset to be sure that the preprocessing is done correctly.

Code
train_ds.as_numpy_iterator().next()
(array([b"Meivazhi\nI've had a go at restarting the Meivazhi article in a style that's more standard for Wikipedia articles. Someone would probably have deleted it pretty quickly if it had stayed in the form you posted. I'd be grateful if you could help at Talk:Meivazhi about the accuracy. \n\nUnfortunately I had to remove your links. The conflict of interest guidelines advise against editors linking to their own sites, and also Wikipedia's attribution policy WP:ATT requires that information should come from third-party published sources rather than personal websites. Do you know of any good newspaper/book accounts of Meivazhi?  \n\nPS What is the salaimanimudi.indlist.com site? A personal site by a member?",
       b'Very well, thanks for the clarification.',
       b'"\n\nHi Emopunkundead13, and Welcome to Wikipedia!  \nWelcome to Wikipedia! I hope you enjoy the encyclopedia and want to stay. As a first step, you may wish to read the Introduction.\n\nIf you have any questions, feel free to ask me at my talk page \xe2\x80\x94 I\'m happy to help. Or, you can ask your question at the New contributors\' help page.\n\n  \nHere are some more resources to help you as you explore and contribute to the world\'s largest encyclopedia...\n\n Finding your way around: \n\n Table of Contents\n\n Department directory\n\n Need help? \n\n Questions \xe2\x80\x94 a guide on where to ask questions.\n Cheatsheet \xe2\x80\x94 quick reference on Wikipedia\'s mark-up codes.\n\n Wikipedia\'s 5 pillars \xe2\x80\x94 an overview of Wikipedia\'s foundations\n The Simplified Ruleset \xe2\x80\x94 a summary of Wikipedia\'s most important rules.\n\n How you can help: \n\n Contributing to Wikipedia \xe2\x80\x94 a guide on how you can help.\n\n Community Portal \xe2\x80\x94 Wikipedia\'s hub of activity.\n\n Additional tips...  \n\n Please sign your messages on talk pages with four tildes (~~~~). This will automatically insert your ""signature"" (your username and a date stamp). The  button, on the tool bar above Wikipedia\'s text editing window, also does this. \n\n If you would like to play around with your new Wiki skills the Sandbox is for you. \n\n Good luck, and have fun. \n"',
       b'the entire human race is quite impressive',
       b'"\n\n{{unblock|1=\nRuslik0 is incorrect in his block of me. I wasn\'t edit waring as defined by Wikipedia: I made one reversion of an edit today.\n\nFurthermore, Ruslik0 is incorrect that I made a false accusation regarding the objective fact that Jeffro77 is engaging in vandalism: Jeffro77 replaced this entrywhich is very similar to the version that existed there since October 31, 2008, with some improvementswith this entry, giving the excuse in his edit summary of ""WP:FRINGE,"" which doesn\'t even make sense as an explanation for his edits: i.e., deletion of a number of citations; deletion of the information on theologian Prof. Wolfhart Pannenberg\'s defense of the theology of the Omega Point Theory; etc. As well, there\'s no need for the large displayed quote, as the previous entry already stated that Prof. David Deutsch doesn\'t agree that the Omega Point is God; furthermore, Jeffro77\'s edit deletes the mention of the fact that Prof. Deutsch endorses the physics of the Omega Point Theory.\n\nIn addtion, Jeffro77\'s edit isn\'t even literate, as he give the following mangled citation to Prof. Deutsch: ""Chapter 14: ""The Ends of the Universe,"" with additional comments by Frank J. Tipler; also available here"". Whereas the version before was properly cited: ""David Deutsch, The Fabric of Reality: The Science of Parallel Universes\xe2\x80\x94and Its Implications (London: Allen Lane The Penguin Press, 1997), ISBN 0713990619. Extracts from Chapter 14: ""The Ends of the Universe,"" with additional comments by Frank J. Tipler; also available here and here.""\n\nAs stated above, Jeffro77\'s excuse in his edit summary doesn\'t even make sense, as it doesn\'t explain why he would delete mention of Prof. Deutsch\'s endorsement of the physics of the Omega Point Theory, particularly since Jeffro77 himself called Deutsch an ""eminent physicist"" in his own edit (i.e., that statement wasn\'t there before): of which argues against the notion that Jeffro77 considers the physics of the Omega Point Theory as fringe. Further, Jeffro77 deleted mention of the fact that Prof. Wolfhart Pannenberg, who is one of the leading theologians in the world, has defended the theology of the Omega Point Theory and Tipler\'s position that the Omega Point is consistent with the Judeo-Christian God.\n\nAdditionally, while Jeffro77\'s ""fringe"" claim\'s aren\'t even relevant to his edit, they have already been refuted numerous times. Indeed, Jeffro77 himself refutes this claim in this very edit of his: to state again, therein Jeffro77 himself called Prof. Deutsch an ""eminent physicist"" in his own words. So obviously Jeffro77 himself must consider Deutsch\'s endorsement of the physics of the Omega Point Theory to be noteworthy, and yet he deleted this endoresement in an area where Jeffro77 himself agrees that Deutsch is eminently qualified and replaced it with a large displayed quotation regarding a matter that Deutsch has no qualification or erudition in, even though the previous version already clearly mentioned that Deutsch disagrees that Omega Point is God. Moreover, in this edit Jeffro77 deletes all mention of the fact that an actual trained theologian, Prof. Wolfhart Pannenberg, who is one of the world\'s leading theologians, has defended the theology of the Omega Point Theory.\n\nAs well, Prof. Tipler himself has defended the theology of the Omega Point Theory and his identification of the Omega Point as being God in a peer-reviewed academic journal: see Frank J. Tipler, ""The Omega Point as Eschaton: Answers to Pannenberg\'s Questions for Scientists,"" Zygon: Journal of Religion & Science, Vol. 24, Issue 2 (June 1989), pp. 217-253. Regarding the physics, Prof. Tipler has published his Omega Point Theory in many peer-reviewed science journals, including a number of the leading physics journals such as Reports on Progress in Physics (one of the world\'s leading physics journals) and Monthly Notices of the Royal Astronomical Society (one of the world\'s leading astrophysics journals). The Wikipedia article on the Omega Point Theory lists seven different mainstream peer-reviewed scientific journals in which Tipler\'s Omega Point Theory has been published (and that\'s not including the Zygon journal). That is quite a significant amount.\n\nJeffro77 is out of',
       b'come on dont u have the balls to block me',
       b'Genealogists routinely make the mistake of assuming that the father-of-record is necessarily the biological father.  However, this is not always the case in polygamous Mormon culture, as wives are officially and unofficially shifted from man to man.  Even sisters have been traded.  We will never really know the full extent of what went on in the Mexican Mormon colony the Romneys sprang fromof course, that was precisely why the left the United States to begin with.  It makes sense, of course, that Romney and his political operatives would want to keep these issues from the American people.',
       b"Then I will make a copyright dispute, which could have been avoided had you given me five minutes to complete my comments on the talk page.  I can prove conclusively the text is mine (I coined the term 'infinity snake' myself and use it to find plagiarism) so I'm not that concerned about it.  You however seem to have a personal issue with me (or a simple power trip) that you're acting on very in appropriately, and if you can't have a mature discussion, I'm going to have no choice but to request arbitration.",
       b'You live in Nashville, right?',
       b"2010 (UTC)\n\nSeriously, fuck this guy's fucking face.  I am so fucking sick of fucking seeing it every fucking time I want to fucking read the fucking article about fucking boobs on the fucking internet. 160.39.54.179  02:10, 23 December",
       b"I'd just keep quiet if I were you.",
       b'can u tell me reasons why this person is notabl;e but B. S. Sahay is.',
       b'Secord\nDo you have any source at all about this guy? What is Captol Records #776 Union Session? If I google it nothing comes up at all.',
       b'REDIRECT Talk:The Fast and the Furious: Tokyo Drift (Original Motion Picture Score)',
       b"But I LOVE pussy!  Just kidding.  I understand what you mean.  And it's okay, becuase I'm sure everyone else knows about Flewis.  He actually is bragging on his page about ge3tting someone blocked.  Man, he might have the right to do that, but bragging about it is like such a little kid I'm ebarassed for him",
       b"Toronto'''\n\nHamilton, Toronto,and Oshawa are now tethered together with the criteria set forth in the article. As such, Toronto, Hamilton and Oshawa form I contiguous urban area, and reaches a pop near 6 mil. Check google earth if you have any doubts.",
       b'"\nThat does not make it yours; when you hit ""save"" you automatically surrender all rights to it. On your second point, your participation here is a privilege, not a right.  > hane\xca\xbc "',
       b'Ill be back in a little later then Ill have a minute to type it out . 68.39.152.45',
       b'"\n\nI\'m not defending or advocating anything.  It\'s about notability.  ""What makes Wikipediocracy notable?""  Editors are trying to keep the most visible and notable practice out of the lede for reasons that seem to be defensive about what should go there.  Why do we need to tiptoe around it?  It is what is.  Calling them pedophiles would cross the line without reliable sources and that doesn\'t seem to be supported and is, of course, a crime.  There should be literally no resistance to pointing out their main avenue of notability -> which is investigating and exposing editors they believe are doing harm.  That\'s what they do.  That\'s what they are known for.  I\'m not morally weighing whether they are right or wrong, just stating what is the elephant in the room.    "',
       b'"\n\n Please do not remove all content from pages without explanation, as you did with this edit to Witch and Wizard. If you continue to do so, you will be blocked from editing.  \xe2\x80\xa2talk\xe2\x80\xa2trib "',
       b"The origin of the Pearl of Great Price \n\nI recently added a paragraph on the origin of the Pearl of Great Price.  The paragraph was reverted.  Before we get into a reversion war, I would like to know what the problem is with my paragraph.  I'm happy to put it wherever deemed appropriate.  But the origin should be noted, correct.  Also, the fact that it was, according to scholars, mis-translated should also be noted, correct?  I think these facts are important to the article.",
       b"YOU'RE RETARDED!=\n\nSCREW YOU FOR DELETING MY ARTICLE! YOU'RE RETARDED! GO KILL YOURSELF IN A BARREL FULL OF SHIT!",
       b'"==Image source problem with Image:Visuel2.jpg==\n\nThanks for uploading Image:Visuel2.jpg. I noticed that the file\'s description page currently doesn\'t specify who created the content, so the copyright status is unclear. If you did not create this file yourself, you will need to specify the owner of the copyright. If you obtained it from a website, then a link to the website from which it was taken, together with a restatement of that website\'s terms of use of its content, is usually sufficient information. However, if the copyright holder is different from the website\'s publisher, their copyright should also be acknowledged.\n\nAs well as adding the source, please add a proper copyright licensing tag if the file doesn\'t have one already. If you created/took the picture, audio, or video then the  tag can be used to release it under the GFDL. If you believe the media meets the criteria at Wikipedia:Non-free content, use a tag such as  or one of the other tags listed at Wikipedia:Image copyright tags#Fair use. See Wikipedia:Image copyright tags for the full list of copyright tags that you can use.\n\nIf you have uploaded other files, consider checking that you have specified their source and tagged them, too. You can find a list of files you have uploaded by following this link. Unsourced and untagged images may be deleted one week after they have been tagged, as described on criteria for speedy deletion. If the image is copyrighted under a non-free license (per Wikipedia:Fair use) then the image will be deleted 48 hours after . If you have any questions please ask them at the Media copyright questions page. Thank you.  (Talk, ) "',
       b'Also, he is a bad person',
       b"SSSSSSSSSSSSShhhhhhhh \n\nGo away. Don't use my suerpage again for any reason or you will regret it.",
       b"Photos\nHow come are only 5 pictures with Madonna?Others articles have more pictures.Don't you think that a photo from Blond Ambition Tour,and a screen shot from Like a virgin;Frozen or Hung up videos should be added?Thank you!thesweetlamb.",
       b"Hypocricy and Jealousy\nI want to let everyone here know that the author deleted my articles when I attempted to create new ones. This is due to the author's insecurites and racist views about the accuracy of my info. I know that many of you feel the same. But don't worry. We'll definitely get around this and the racist author.",
       b'fair enough\n\nits a fair point...... but why oh why did i do so much fulfilling all of the wikpedia guidlines for personal content , editting style , correct sourcing and referencing , wiki formatting etc .....months tireless work on an article which was to be so casually consigned to the dustbin....... \n\nhowever much i respect your views, a redirect to the clairvoyance article  would however be entirely meaningless... one is not exchangeable for the other...  like cheese isn`t milk etc.....\n\ncovering the groundwork general background to establish a context for understanding extra sensory phenomena in general needed to be done  and it might as well happen at the Clairsentience page for now.....  \n\nthe reason i did it there was because fundamental doubts were being expressed about my earliest article`s conents were concerned implicitly announcing that the background for understanding extra sensory phenomena in general had not been done.... which was  ... as remember saying to you a few weeks ago   the frustrating context which motivated the writing of the second article with all of its references to the background research of brennan , lylle , mckenna , bohm , wilber , sheldrake   etc......because this background had to be established before any specifics about clairsentience could even be begun to be aproached  ....... thus the many weeks days and months of toil which has been endured to even get this background matereal into a wikpedia format and guidlines shape  ......  all of which criterea have been met  .... or at leasst  were until the finished product was mindlessly deleted...\n\nthe new additional altered states matereal  was a tentative beginning into finding a context for tentatively describing the specifics of the clairsentience phenomena itself...\n\nbut , agian it was trashed with no thought or care....',
       b'"\n\nPlease replace the deleted page ""A Veritable Smorgasboard"". Deleting it was completely unnecessary. Honestly, do you think Wikipdedia can\'t handle another 150kB page? I would fix this for you, but I do not know my way around Wikipedia\'s syntax.\n(Apologies in advance for my deficient signature) - Fiantres"',
       b"hahaha, nice try.  but they don't agree with your ethnocentric bullshit.  try that on for size!   17:08, 16 Jun 2005 (UTC)",
       b'Oppose There is no doubt the Nazi used Gun Control as a means of suppression. To deny that is idiotic or agenda promoting. The argument that the right to bear arms is a deterent to suppression is very strong and logical. It seems logic and political agenda pushers often diverge. Suppressed people do not have the right to bear arms. Those who seek to suppress people for their political purposes seek to suppress the right to bear arms. 172.56.11.104',
       b'"\n\n Pozdrav \n\nPozdrav Direktore, ovo je tekst koji sam stavio odmah iza tvog odgovora na diskusionoj strani od clanka ""Differences between standard Serbian/Croatian/Bosnian language"". Pozdrav tebi i sve najbolje:\nTotally agree with you Direktor, and I am sure that your highly reasonable opinions and facts you are presenting on wikipedia are highly respected and accepted by the waste majority of its readers. Please, if you can, - do your regular check-ups of the articles concerning South Slavic languages and make sure they\xe2\x80\x99re not presenting some misleading information. I would just add to this, that although \xe2\x80\x98the political reason is the officially maintained distaste of the so-called ""pan-Yugoslav commonness\xe2\x80\x9d\xe2\x80\x99, this distaste should be \xe2\x80\x98expressed\xe2\x80\x99 in a more civilized manner than creating some non-acceptable partial maps showing only a half of the Shtokavian speaking area, or trying to camouflage the factual state of the close ties within the Central South Slavic system (language) at the templates featured in the articles about South Slavic languages and dialects. And most importantly - that ridiculous \xe2\x80\x98pan-Yugo\xe2\x80\x99 distaste which openly sends a message of hatred and separatism, should not, by any mean be reflected in the language area, because it only shows how low and how uncivilized its \xe2\x80\x98supporters\xe2\x80\x99 can be, no matter what \xe2\x80\x98not-supported-by anyone-in-scientific-world\xe2\x80\x99 theories they may point out as their sources. \nAnd, at the end, as an example of a civilized political distaste between nations, here\xe2\x80\x99s the example of Americans and Canadians. Majority of Canadians feel in a different extent, a kind of aversion to Americans, especially since the start of the \xe2\x80\x98George-Bush-era\xe2\x80\x99 in USA. But still, nobody says that Canadians speak a \xe2\x80\x98different language\xe2\x80\x99 than Americans, and nobody tries to hide the common history facts and the strong cultural and economical ties that exist between these 2 countries. This is a typical way of overcoming any kind of \xe2\x80\x98distaste\xe2\x80\x99 in a human and civilized way, all the other ways are just a shame for humanity.\nBest Regards to you Direktor and to your beautiful and cosmopolitan Split and Dalmatia. Best Cheerful Greetings;"'],
      dtype=object), array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]]))

And we check also the shape. We expect a feature of shape (batch, ) and a target of shape (batch, number of labels).

Code
print(
  f"text train shape: {train_ds.as_numpy_iterator().next()[0].shape}\n",
  f" text train type: {train_ds.as_numpy_iterator().next()[0].dtype}\n",
  f"label train shape: {train_ds.as_numpy_iterator().next()[1].shape}\n",
  f"label train type: {train_ds.as_numpy_iterator().next()[1].dtype}\n"
  )
text train shape: (32,)
  text train type: object
 label train shape: (32, 6)
 label train type: int64

4 Preprocessing

Of course preprocessing! Text is not the type of input a NN can handle. The TextVectorization layer is meant to handle natural language inputs. The processing of each example contains the following steps: 1. Standardize each example (usually lowercasing + punctuation stripping) 2. Split each example into substrings (usually words) 3. Recombine substrings into tokens (usually ngrams) 4. Index tokens (associate a unique int value with each token) 5. Transform each example using this index, either into a vector of ints or a dense float vector.

For more reference, see the documentation at the following link.

Code
text_vectorization = TextVectorization(
  max_tokens=config.max_tokens,
  standardize="lower_and_strip_punctuation",
  split="whitespace",
  output_mode="int",
  output_sequence_length=config.output_sequence_length,
  pad_to_max_tokens=True
  )

# prepare a dataset that only yields raw text inputs (no labels)
text_train_ds = train_ds.map(lambda x, y: x)
# adapt the text vectorization layer to the text data to index the dataset vocabulary
text_vectorization.adapt(text_train_ds)

This layer is set to: - max_tokens: 20000. It is common for text classification. It is the maximum size of the vocabulary for this layer. - output_sequence_length: 911. See Figure 3 for the reason why. Only valid in "int" mode. - output_mode: outputs integer indices, one integer index per split string token. When output_mode == “int”, 0 is reserved for masked locations; this reduces the vocab size to max_tokens - 2 instead of max_tokens - 1. - standardize: "lower_and_strip_punctuation". - split: on whitespace.

To preserve the original comments as text and also have a tf.data.Dataset in which the text is preprocessed by the TextVectorization function, it is possible to map it to the features of each dataset.

Code
processed_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)
processed_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=tf.data.experimental.AUTOTUNE
)

5 Model

5.1 Definition

Define the model using the Functional API.

Code
def get_deeper_lstm_model():
    clear_session()
    inputs = Input(shape=(None,), dtype=tf.int64, name="inputs")
    embedding = Embedding(
        input_dim=config.max_tokens,
        output_dim=config.embedding_dim,
        mask_zero=True,
        name="embedding"
    )(inputs)
    x = Bidirectional(LSTM(256, return_sequences=True, name="bilstm_1"))(embedding)
    x = Bidirectional(LSTM(128, return_sequences=True, name="bilstm_2"))(x)
    # Global average pooling
    x = GlobalAveragePooling1D()(x)
    # Add regularization
    x = Dropout(0.3)(x)
    x = Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
    x = LayerNormalization()(x)
    outputs = Dense(len(config.labels), activation='sigmoid', name="outputs")(x)
    model = Model(inputs, outputs)
    model.compile(optimizer='adam', loss="binary_crossentropy", metrics=config.metrics, steps_per_execution=32)
    
    return model

lstm_model = get_deeper_lstm_model()
lstm_model.summary()

5.2 Callbacks

Finally, the model has been trained using 2 callbacks: - Early Stopping, to avoid to consume the kaggle GPU time. - Model Checkpoint, to retrieve the best model training information.

Code
# callbacks
my_es = config.get_early_stopping()
my_mc = config.get_model_checkpoint(filepath="/checkpoint.keras")
callbacks = [my_es, my_mc]

5.3 Final preparation before fit

Considering the dataset is imbalanced, to increase the performance we need to calculate the class weight. This will be passed during the training of the model.

Code
lab = pd.DataFrame(columns=config.labels, data=ytrain)
r = lab.sum() / len(ytrain)
class_weight = dict(zip(range(len(config.labels)), r))
df_class_weight = pd.DataFrame.from_dict(
  data=class_weight,
  orient='index',
  columns=['class_weight']
  )
df_class_weight.index = config.labels
Code
library(reticulate)
py$df_class_weight
Table 4: Class weight
              class_weight
toxic          0.095900590
severe_toxic   0.009928468
obscene        0.052757858
threat         0.003061800
insult         0.049132042
identity_hate  0.008710911

It is also useful to define the steps per epoch for train and validation dataset. This step is required to avoid to not consume entirely the dataset during the fit, which happened to me.

Code
steps_per_epoch = config.train_samples // config.batch_size
validation_steps = config.val_samples // config.batch_size

5.4 Fit

The fit has been done on Kaggle to levarage the GPU. Some considerations about the model:

  • .repeat() ensure the model sees all the dataset.
  • epocs is set to 100.
  • validation_data has the same repeat.
  • callbacks are the one defined before.
  • class_weight ensure the model is trained using the frequency of each class, because our dataset is imbalanced.
  • steps_per_epoch and validation_steps depend on the use of repeat.
Code
history = model.fit(
  processed_train_ds.repeat(),
  epochs=config.epochs,
  validation_data=processed_val_ds.repeat(),
  callbacks=callbacks,
  class_weight=class_weight,
  steps_per_epoch=steps_per_epoch,
  validation_steps=validation_steps
  )

Now we can import the model and the history trained on Kaggle.

Code
model = load_model(filepath=config.model)
history = pd.read_excel(config.history)

5.5 Evaluate

Code

validation = model.evaluate(
  processed_val_ds.repeat(),
  steps=validation_steps, # 748
  verbose=0
  )
Code
val_metrics <- tibble(
  metric = c("loss", "precision", "recall", "auc", "f1_score"),
  value = py$validation
  )
val_metrics
Table 5: Model validation metric
# A tibble: 5 × 2
  metric     value
  <chr>      <dbl>
1 loss      0.0542
2 precision 0.789 
3 recall    0.671 
4 auc       0.957 
5 f1_score  0.0293

5.6 Predict

For the prediction, the model does not need to repeat the dataset, because it has already been trained on all of the train data. Now it has just to consume the new data to make the prediction.

Code

predictions = model.predict(processed_test_ds, verbose=0)

5.7 Confusion Matrix

The best way to assess the performance of a multi label classification is using a confusion matrix. Sklearn has a specific function to create a multi label classification matrix to handle the fact that there could be multiple labels for one prediction.

5.7.1 Grid Search Cross Validation for best threshold

Grid Search CV is a technique for fine-tuning hyperparameter of a ML model. It systematically search through a set of hyperparamenter values to find the combination which led to the best model performance. In this case, I am using a KFold Cross Validation is a resempling technique to split the data into k consecutive folds. Each fold is used once as a validation while the k - 1 remaining folds are the training set. See the documentation for more information.

The model is trained to optimize the recall. The decision was made because the cost of missing a True Positive is greater than a False Positive. In this case, missing a injurious observation is worst than classifying a clean one as bad.

5.7.2 Confidence threshold and Precision-Recall trade off

Whilst the KFold GDCV technique is usefull to test multiple hyperparameter, it is important to understand the problem we are facing. A multi label deep learning classifier outputs a vector of per-class probabilities. These need to be converted to a binary vector using a confidence threshold.

  • The higher the threshold, the less classes the model predicts, increasing model confidence [higher Precision] and increasing missed classes [lower Recall].
  • The lower the threshold, the more classes the model predicts, decreasing model confidence [lower Precision] and decreasing missed classes [higher Recall].

Threshold selection mean we have to decide which metric to prioritize, based on the problem we are facing and the relative cost of misduging. We can consider the toxic comment filtering a problem similiar to cancer diagnostic. It is better to predict cancer in people who do not have it [False Positive] and perform further analysis than do not predict cancer when the patient has the disease [False Negative].

I decide to train the model on the F1 score to have a balanced model in both precision and recall and leave to the threshold selection to increase the recall performance.

Moreover, the model has been trained on the macro avarage F1 score, which is a single performance indicator obtained by the mean of the Precision and Recall scores of individual classses.

\[ F1\ macro\ avg = \frac{\sum_{i=1}^{n} F1_i}{n} \]

It is useful with imbalanced classes, because it weights each classes equally. It is not influenced by the number of samples of each classes. This is sette both in the config.metrics and find_optimal_threshold_cv.

5.7.2.1 f1_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_f1, best_score_f1 = config.find_optimal_threshold_cv(ytrue, y_pred_proba, f1_score)

print(f"Optimal threshold: {optimal_threshold_f1}")
Optimal threshold: 0.15000000000000002
Code
print(f"Best score: {best_score_f1}")
Best score: 0.4788653077945807
Code

# Use the optimal threshold to make predictions
final_predictions_f1 = (y_pred_proba >= optimal_threshold_f1).astype(int)

Optimal threshold f1 score: 0.15. Best score: 0.4788653.

5.7.2.2 recall_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_recall, best_score_recall = config.find_optimal_threshold_cv(ytrue, y_pred_proba, recall_score)

# Use the optimal threshold to make predictions
final_predictions_recall = (y_pred_proba >= optimal_threshold_recall).astype(int)

Optimal threshold recall: 0.05. Best score: 0.8095814.

5.7.2.3 roc_auc_score

Code
ytrue = ytest.astype(int)
y_pred_proba = predictions
optimal_threshold_roc, best_score_roc = config.find_optimal_threshold_cv(ytrue, y_pred_proba, roc_auc_score)

print(f"Optimal threshold: {optimal_threshold_roc}")
Optimal threshold: 0.05
Code
print(f"Best score: {best_score_roc}")
Best score: 0.8809499649742268
Code

# Use the optimal threshold to make predictions
final_predictions_roc = (y_pred_proba >= optimal_threshold_roc).astype(int)

Optimal threshold roc: 0.05. Best score: 0.88095.

5.7.3 Confusion Matrix Plot

Code
# convert probability predictions to predictions
ypred = predictions >=  optimal_threshold_recall # .05
ypred = ypred.astype(int)

# create a plot with 3 by 2 subplots
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = axes.flatten()
mcm = multilabel_confusion_matrix(ytrue, ypred)
# plot the confusion matrices for each label
for i, (cm, label) in enumerate(zip(mcm, config.labels)):
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(ax=axes[i], colorbar=False)
    axes[i].set_title(f"Confusion matrix for label: {label}")
plt.tight_layout()
plt.show()
Figure 4: Multi Label Confusion matrix

5.8 Classification Report

Code

cr = classification_report(
  ytrue,
  ypred,
  target_names=config.labels,
  digits=4,
  output_dict=True
  )
df_cr = pd.DataFrame.from_dict(cr).reset_index()
Code
library(reticulate)
df_cr <- py$df_cr %>% dplyr::rename(names = index)
cols <- df_cr %>% colnames()
df_cr %>% 
  pivot_longer(
    cols = -names,
    names_to = "metrics",
    values_to = "values"
  ) %>% 
  pivot_wider(
    names_from = names,
    values_from = values
  )
Table 6: Classification report
# A tibble: 10 × 5
   metrics       precision recall `f1-score` support
   <chr>             <dbl>  <dbl>      <dbl>   <dbl>
 1 toxic            0.552  0.890      0.682     2262
 2 severe_toxic     0.236  0.917      0.375      240
 3 obscene          0.550  0.936      0.692     1263
 4 threat           0.0366 0.493      0.0681      69
 5 insult           0.471  0.915      0.622     1170
 6 identity_hate    0.116  0.720      0.200      207
 7 micro avg        0.416  0.896      0.569     5211
 8 macro avg        0.327  0.812      0.440     5211
 9 weighted avg     0.495  0.896      0.629     5211
10 samples avg      0.0502 0.0848     0.0597    5211

6 Conclusions

The BiLSTM model is optimized to have an high recall is performing good enough to make predictions for each label. Considering the low support for the threat label, the performance is not bad. See Table 2 and Figure 1: the threat label is only 0.27 % of the observations. The model has been optimized for recall because the cost of not identifying a injurious comment as such is higher than the cost of considering a clean comment as injurious.

Possibile improvements could be to increase the number of observations, expecially for the threat one. In general there are too many clean comments. This could be avoided doing an undersampling of the clean comment, which I explicitly avoided to check the performance on the BiLSTM with an imbalanced dataset, leveraging the class weight method.